ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING
Identifieur interne : 000061 ( Main/Exploration ); précédent : 000060; suivant : 000062ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING
Auteurs : Marthe Lagarrigue [France] ; Florence Rossant [France] ; Alain Pierrot [France] ; Joël Gardes [France] ; Christophe Maldivi [France] ; Eric Petit [France]Source :
English descriptors
- mix :
Abstract
Digitized re-publishing of documents has become nowadays a very important issue. Optical Character Recognition (OCR) has been intensively used to this aim, as it performs the transcription of the text images into electronic files, allowing display functionalities, indexation, enrichment and broadcasting. However, such software still fails in many configurations, so that the transcription does not reach the required editorial quality (99% of recognition are required for an ergonomic reading). In the OZALID project, we propose to rely on crowdsourcing for correcting OCR results. One main issue is then to determine when the crowdsourcing has reached its limits. For that, we present a feasibility study of an original protocol based on indicators that quantify the recognition quality in both semantic and semiotic ways. These indicators are calculated and followed up during the entire crowdsourcing process until stability. Experimental results show that the proposed observables converge after some correction iterations allowing automatically stopping the crowdsourcing process and dealing with huge amount of data.
Url:
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 000016
- to stream Hal, to step Curation: 000016
- to stream Hal, to step Checkpoint: 000021
- to stream Main, to step Merge: 000061
- to stream Main, to step Curation: 000061
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING</title>
<author><name sortKey="Lagarrigue, Marthe" sort="Lagarrigue, Marthe" uniqKey="Lagarrigue M" first="Marthe" last="Lagarrigue">Marthe Lagarrigue</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING"><orgName>ISEP</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-324747" type="direct"><org type="institution" xml:id="struct-324747" status="INCOMING"><orgName>institut supérieur d'éléctronique de paris</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Rossant, Florence" sort="Rossant, Florence" uniqKey="Rossant F" first="Florence" last="Rossant">Florence Rossant</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING"><orgName>ISEP</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-324747" type="direct"><org type="institution" xml:id="struct-324747" status="INCOMING"><orgName>institut supérieur d'éléctronique de paris</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Pierrot, Alain" sort="Pierrot, Alain" uniqKey="Pierrot A" first="Alain" last="Pierrot">Alain Pierrot</name>
<affiliation wicri:level="1"><hal:affiliation type="department" xml:id="struct-388118" status="INCOMING"><orgName>I2S</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-388117" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-388117" type="direct"><org type="institution" xml:id="struct-388117" status="INCOMING"><orgName>I2S</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Gardes, Joel" sort="Gardes, Joel" uniqKey="Gardes J" first="Joël" last="Gardes">Joël Gardes</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID"><orgName>Orange Labs [Grenoble]</orgName>
<desc><address><addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-366011" type="direct"><org type="institution" xml:id="struct-366011" status="INCOMING"><orgName>Orange</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Maldivi, Christophe" sort="Maldivi, Christophe" uniqKey="Maldivi C" first="Christophe" last="Maldivi">Christophe Maldivi</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID"><orgName>Orange Labs [Grenoble]</orgName>
<desc><address><addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-366011" type="direct"><org type="institution" xml:id="struct-366011" status="INCOMING"><orgName>Orange</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Petit, Eric" sort="Petit, Eric" uniqKey="Petit E" first="Eric" last="Petit">Eric Petit</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID"><orgName>Orange Labs [Grenoble]</orgName>
<desc><address><addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-366011" type="direct"><org type="institution" xml:id="struct-366011" status="INCOMING"><orgName>Orange</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01075265</idno>
<idno type="halId">hal-01075265</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01075265</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01075265</idno>
<date when="2014-11-01">2014-11-01</date>
<idno type="wicri:Area/Hal/Corpus">000016</idno>
<idno type="wicri:Area/Hal/Curation">000016</idno>
<idno type="wicri:Area/Hal/Checkpoint">000021</idno>
<idno type="wicri:Area/Main/Merge">000061</idno>
<idno type="wicri:Area/Main/Curation">000061</idno>
<idno type="wicri:Area/Main/Exploration">000061</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING</title>
<author><name sortKey="Lagarrigue, Marthe" sort="Lagarrigue, Marthe" uniqKey="Lagarrigue M" first="Marthe" last="Lagarrigue">Marthe Lagarrigue</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING"><orgName>ISEP</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-324747" type="direct"><org type="institution" xml:id="struct-324747" status="INCOMING"><orgName>institut supérieur d'éléctronique de paris</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Rossant, Florence" sort="Rossant, Florence" uniqKey="Rossant F" first="Florence" last="Rossant">Florence Rossant</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-102803" status="INCOMING"><orgName>ISEP</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-324747" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-324747" type="direct"><org type="institution" xml:id="struct-324747" status="INCOMING"><orgName>institut supérieur d'éléctronique de paris</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Pierrot, Alain" sort="Pierrot, Alain" uniqKey="Pierrot A" first="Alain" last="Pierrot">Alain Pierrot</name>
<affiliation wicri:level="1"><hal:affiliation type="department" xml:id="struct-388118" status="INCOMING"><orgName>I2S</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-388117" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-388117" type="direct"><org type="institution" xml:id="struct-388117" status="INCOMING"><orgName>I2S</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Gardes, Joel" sort="Gardes, Joel" uniqKey="Gardes J" first="Joël" last="Gardes">Joël Gardes</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID"><orgName>Orange Labs [Grenoble]</orgName>
<desc><address><addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-366011" type="direct"><org type="institution" xml:id="struct-366011" status="INCOMING"><orgName>Orange</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Maldivi, Christophe" sort="Maldivi, Christophe" uniqKey="Maldivi C" first="Christophe" last="Maldivi">Christophe Maldivi</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID"><orgName>Orange Labs [Grenoble]</orgName>
<desc><address><addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-366011" type="direct"><org type="institution" xml:id="struct-366011" status="INCOMING"><orgName>Orange</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Petit, Eric" sort="Petit, Eric" uniqKey="Petit E" first="Eric" last="Petit">Eric Petit</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-242003" status="VALID"><orgName>Orange Labs [Grenoble]</orgName>
<desc><address><addrLine>28 Chemin du Vieux Chêne, 38240 Meylan</addrLine>
<country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-366011" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-366011" type="direct"><org type="institution" xml:id="struct-366011" status="INCOMING"><orgName>Orange</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="mix" xml:lang="en"><term>Digital edition</term>
<term>OCR</term>
<term>correction protocol</term>
<term>crowdsourcing</term>
<term>quality assessment</term>
<term>semantics</term>
<term>semiotics</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Digitized re-publishing of documents has become nowadays a very important issue. Optical Character Recognition (OCR) has been intensively used to this aim, as it performs the transcription of the text images into electronic files, allowing display functionalities, indexation, enrichment and broadcasting. However, such software still fails in many configurations, so that the transcription does not reach the required editorial quality (99% of recognition are required for an ergonomic reading). In the OZALID project, we propose to rely on crowdsourcing for correcting OCR results. One main issue is then to determine when the crowdsourcing has reached its limits. For that, we present a feasibility study of an original protocol based on indicators that quantify the recognition quality in both semantic and semiotic ways. These indicators are calculated and followed up during the entire crowdsourcing process until stability. Experimental results show that the proposed observables converge after some correction iterations allowing automatically stopping the crowdsourcing process and dealing with huge amount of data.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
</list>
<tree><country name="France"><noRegion><name sortKey="Lagarrigue, Marthe" sort="Lagarrigue, Marthe" uniqKey="Lagarrigue M" first="Marthe" last="Lagarrigue">Marthe Lagarrigue</name>
</noRegion>
<name sortKey="Gardes, Joel" sort="Gardes, Joel" uniqKey="Gardes J" first="Joël" last="Gardes">Joël Gardes</name>
<name sortKey="Maldivi, Christophe" sort="Maldivi, Christophe" uniqKey="Maldivi C" first="Christophe" last="Maldivi">Christophe Maldivi</name>
<name sortKey="Petit, Eric" sort="Petit, Eric" uniqKey="Petit E" first="Eric" last="Petit">Eric Petit</name>
<name sortKey="Pierrot, Alain" sort="Pierrot, Alain" uniqKey="Pierrot A" first="Alain" last="Pierrot">Alain Pierrot</name>
<name sortKey="Rossant, Florence" sort="Rossant, Florence" uniqKey="Rossant F" first="Florence" last="Rossant">Florence Rossant</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000061 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000061 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Hal:hal-01075265 |texte= ASSESSING THE QUALITY OF DIGITAL RE-PUBLISHING OF TEXTUAL DOCUMENTS THROUGH THE FOLLOW-UP OF A CORRECTION PROTOCOL BY CROWDSOURCING }}
This area was generated with Dilib version V0.6.32. |